|
Data source settingsAfter opening a data file you have to make a number of settings to make sure that the source data is interpreted and grouped the way you want. These settings are found on the Settings pane at the left.
Input data settings (Delimiters)The Input Data settings (on the Settings pane at the left) specify how the input data must be interpreted. These settings are different for each data type. For a CSV file, for example, it is important to specify the delimiter that separates data fields. PDF files are already delimited naturally by pages, so the input data settings for PDF files are interpretation settings for text in the file. For a CSV FileIn a CSV file, data is read line by line, where each line can contain multiple fields, separated by a delimiter. Even though CSV stands for comma-separated values, fields may be separated using any character, including commas, tabs, semicolons, and pipes. For a PDF FilePDF files have a clear and unmovable delimiter: pages. So, the Input Data settings are not used to set delimiters. Instead, these options determine how words, lines and paragraphs are detected when you select content in the PDF to extract data from it. For a databaseDatabases all return the same type of information. Therefore the Input Data options for a database refer to the tables inside the database. Clicking on any of the tables shows the first line of the data in that table. For a text fileBecause text files have many different shapes and sizes, there are a lot of input data settings for these files. You can add or remove characters in lines if it has a header you want to get rid of, or strange characters at the beginning of your file, for example; you can set a line width if you are still working with old line printer data; etc. For an XML fileXML is a special file format because these file types can have a theoretically unlimited number of structure types. The input data has two simple options that basically determine at which node level a new record is created. You can either select an element type, to create a new delimiter every time that element is encountered, or choose to use the root node. If there is only one top-level
element, there will only be one record before the Boundaries are set. Record boundariesBoundaries are the division between records: they define where one record ends and the next record begins. Using boundaries, you can organize the data the way you want. To set a boundary, a specific trigger must be defined. For an explanation of all Boundaries options per file type, see Boundaries. Data format settingsBy default the data type of extracted data is a String, but each field in the Data Model can be set to contain another data type (see Data types). When that data type is Date, Number or Currency, the DataMapper will expect the data in the data source to be formatted in a certain way, depending on the settings. The default format for dates, numbers and currencies can be set in three places: in the user preferences, in the data source settings, and per field in the Data Model. By default, the user preferences are set to the system preferences. These user preferences become the default format values for any newly created data mapping configuration. To change these preferences, select Window > Preferences > DataMapper > DataMapper default format (see Datamapper preferences). Data format settings defined in the data source settings apply to any new extraction made in the current data mapping configuration. These settings are made on the Settings pane; see Settings pane. Settings for a field that contains extracted data are made via the properties of the Extract step that the field belongs to (see Setting the data type). Any format settings specified per field are always used, regardless of the user preferences or data source settings. Data format settings tell the DataMapper how certain types of data are formatted in the data source. They don't determine how these data are formatted in the Data Model or in a template. In the Data Model, data are converted to the native data type. Dates, for example, are converted to a DateTime object in the Data Model, and will always be shown as "year-month-day" plus the time stamp, for example: 2012-04-11 12.00 AM.
|
|